Probabilistic Event Categorization 
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Abstract 

This paper describes the automation of a 
new text categorization task. The cate- 
gories assigned in this task are more syntac- 
tically, semantically, and contextually com- 
plex than those typically assigned by fully 
automatic systems that process unseen test 
data. Our system for assigning these cate- 
gories uses a probabilistic classifier, devel- 
oped with a recent method for formulat- 
ing a probabilistic model from a predefined 
set of potential features (Bruce 1995, Bruce 
and Wiebe 1994, Pedersen et al. 1996). 
This paper focuses on feature selection. It 
presents various types of properties experi- 
mented with in this work. We identify and 
evaluate various approaches to organizing 
the collocational properties into features. 
With the more complex features we define, 
there is an organization that yields the best 
results; but the same organization with 
less complex features yields inferior results. 
The results suggest a way to take advan- 
tage of properties that are low frequency 
but strongly indicative of a class. The 
problems of recognizing and organizing the 
various kinds of contextual information re- 
quired to perform a linguistically complex 
categorization task has rarely been system- 
atically investigated in NLP. 

1 Introduction 

This paper reports findings on performing automatic 
event categorization, i.e., recognizing high-level se- 
mantic classes of the main state or event that a 



clause is about. The event categorization addressed 
in this paper is new. In this classification scheme, 
the event reported by the main clause of a sentence 
is categorized as being either: (1) a private state 
(the clause is about, e.g., a belief, emotion, or per- 
ception), (2) a speech event (the clause is about, 
e.g., a saying or declaring event), or (3) other (the 
clause is about another kind of state or event). The 
speech-event category is divided into subcategories 
based on how the event is presented syntactically 
and how much of what was said is presented in the 
sentence. The language used to describe private 
states and speech events is rich and varied, includ- 
ing idiomatic and metaphorical expressions (Barn- 
den 1992). There is a large amount of syntactic 
and part-of-speech variation, and the categorization 
is context dependent. Although the categories are 
complex, it has been demonstrated in an inter-coder 
reliability study (Wiebe and Bruce 1997) that these 
classifications can be performed with high reliability 
by human judges. 

The method we use to automate this task is proba- 
bilistic classification. We perform an explicit model 
search to find a model that provides a good char- 
acterization of the relationships among the targeted 
classification and properties in the data (Bruce 1995, 
Bruce and Wiebe 1994, Pedersen et al. 1997). Doing 
so is in contrast to one common practice in NLP of 
assuming a certain model form, such as n-gram and 
Naive Bayesian models, without testing how well 
those models fit the data. In the experiments re- 
ported on here, the models identified as best for the 
task being performed vary in structure in response to 
the type of features used, supporting the usefulness 
of performing model search. The method permits 
the use of many features of different kinds, includ- 



ing n-gram properties as well as those types of fea- 
tures typically included in Maximum Entropy mod- 
els (Berger et al. 1996) and Decision Trees (Breiman 
et al. 1994). In addition, as in Decision Tree induc- 
tion (Breiman et al. 1994), feature selection can be 
performed as part of the process of model formula- 
tion. 

We experimented with many different kinds of 
properties to perform the classification task. These 
properties are presented in this paper. They are de- 
termined fully automatically, and range from shal- 
low surface characteristics (e.g., word counts and 
word co-occurrence) to more syntactically complex 
structures (e.g., an adjective serving as subject com- 
plement) as well as discourse features (e.g., whether 
or not the sentence is the first one in a para- 
graph). Many of the properties would be applicable 
to other event categorization and information ex- 
traction tasks for which one event out of many in a 
sentence is targeted, or for which the classifications 
are highly context dependent. 

We also experimented with various ways to orga- 
nize collocational properties into features, including 
properties that are often used in word-sense disam- 
biguation systems. With the more complex proper- 
ties we define, there is an organization that yields 
the best results, but with the less complex prop- 
erties, the same organization yields inferior results. 
The results suggest a way to take better advantage 
of properties that are low frequency but strongly in- 
dicative of a class. 

In addition to such factors as the form of the 
model and the method used to choose collocations, 
the organization used for collocational information is 
another experimental parameter that can affect per- 
formance of an NLP system that uses collocational 
information. 

A preprocessor was developed to determine the 
properties according to which the classifications are 
made. It is composed of off-the-shelf components 
and new components. The new components and 
pointers to the existing ones will be available over 
the World Wide Web. The annotation instructions, 
the results of the intercoder-reliability study, and ta- 
bles of feature values for experimentation will also be 
available at that site. 

The remainder of this paper is organized as fol- 
lows. The method used for model selection is de- 
scribed in section ||. The results of the experiments 
are given up front in section pL and then discussed in 
subsequent sections. Section^ details the properties 
experimented with, and section |^ presents different 
possible organizations of contextual information into 
features. Section ^ discusses the results, and section 
m is the conclusion. 



2 The Method 

We use a supervised learning method for automati- 
cally formulating probabilistic models for use in clas- 
sification, where a classifier is induced from a corpus 
of tagged data. Suppose there is a training sample in 
which each sentence is represented by the variables 
(Fx , . . . , F„_i , S). Variables {Fx,..., F„_ i ) corre- 
spond to properties of the sentence and the context 
in which it appears, and variable S is the classifica- 
tion variable, the variable that corresponds to the 
classification being made. Our task is to induce a 
classifier that will assign a value for S, given the 
values that the feature variables have for this sen- 
tence. 

We adopt a statistical approach whereby a prob- 
abilistic model is selected that describes the inter- 
actions among the feature variables. This approach 
is described fully elsewhere (Bruce 1995, Bruce and 
Wiebe 1994, Pedersen et al. 1996). Such a model 
can form the basis of a probabilistic classifier since 
it specifies the probability of observing any and all 
combinations of the values of the feature variables. 

In the fully saturated model, all variables are in- 
terdependent, and the parameters of the model cor- 
respond to combinations of values of all of the vari- 
ables in the model. If the data sample can be ade- 
quately characterized by a less complex model, i.e., 
a model in which there are fewer interactions be- 
tween variables, then more reliable parameter esti- 
mates can be obtained. How well a model character- 
izes the training sample is determined by measuring 
the fit of the model to the sample, i.e., how well 
the distribution defined by the model matches the 
distribution observed in the training sample. 

A good strategy for developing probabilistic clas- 
sifiers is to perform an explicit model search to select 
the model to use in classification. The model selec- 
tion algorithm used here performs a backward se- 
quential search (a type of greedy search) of the class 
of decomposable models, a class of models that have 
many computational advantages (Whittaker 1990). 

A backward sequential search is performed, which 
begins by designating the saturated model as the 
current model. At each stage, we generate the set of 
decomposable models of complexity level i — 1 that 
can be created by removing an edge from the current 
model of complexity level i. The evaluation criterion 
is applied to each of these models to determine which 
yields the least degradation in fit from the current 
model. If the degradation is within limits estab- 
lished by the evaluation criterion, this becomes the 
current model and the search continues. Otherwise, 
the search stops. For a further discussion of search 
strategies and evaluation criteria, see Pedersen et. 
al. (1997). 

The model selection process also performs feature 
selection. If a model is selected where there is no 
edge connecting a feature variable to the classifica- 



tion variable, then that feature has been, in essence, 
dropped from the classifier. The Log-likelihood ra- 
tio statistic G 2 (Bishop et al. 1975) is used as the 
model evaluation criterion in all of the experiments. 

3 The Experiments and Results 

This section presents the results of the comparative 
experiments performed in this paper. The proper- 
ties used to form features are presented in section ^, 
and the various organizations of collocational infor- 
mation are given and discussed in section ||. 

After a large amount of background experimenta- 
tion, the best experiment we found involves: (1) four 
non-collocational features (those labeled the Current 
Best in section ||), and (2) the collocational proper- 
ties labeled Syntactic Patterns in section 0, orga- 
nized as per-class-2, which is described in section |5|. 
A feature was judged to be good if, after the model 
search procedure has completed, that feature is still 
included in (one of) the model(s) with the highest 
accuracy. 

Since our interests are to investigate the relative 
goodness of the various collocational patterns and 
of the organizations, we varied only these factors, 
and used the same set of non-collocational features 
throughout. 

The total amount of data consists of 2,544 main 
clauses from the Wall Street Journal Treebank cor- 
pus (Marcus et al. 1993). The distribution of classes 
over the entire data set is shown in table 1. The 
lower bound for the problem — the frequency in the 
entire data set of the most frequent class (Gale et al. 
1992b)— is 52%. 

For clearer understanding of the factors covaried 
in the experiments presented in table 2, the model 
search procedure was not permitted to drop any fea- 
tures from the model. 

10-fold cross-validation was performed. For each 
fold, the collocations were determined and model 
search was performed anew. Each fold is a differ- 
ent split between l/10th testing data (TestData), 
and 9/ 10th training data (Training Data). For 
each fold, TrainingData was further split into 
9/10th training data (SearchData; 81% of the to- 
tal data) and l/10th test data (SelectionData; 9% 
of the total data). Model search was performed on 
SearchData, and the model M with the highest ac- 
curacy on SelectionData was selected. Finally, the 
accuracy, precision, and recall of Model M on the 
real test set, TestData, were determined; the re- 
sults presented in table 2 are the averages of those 
results over all of the folds. Thus, which model to 
choose as best is based on a search-selection split of 
the training data, and the results are reported on 
separate, held-out test data. 

In table 2, rows correspond to the organizations 
defined in section ||: (PC-1 for per-class-1; PC-2 for 
per-class-2; OR-1 for over-range-1, and OR-2 for 



over-range-2). Columns correspond to collocation 
types. 

A better result than any in the table was obtained 
in a separate experiment, in which some hand-tuning 
of the collocational features was performed: over 
78% by manually grouping some related information 
into features. 

4 Properties 

The properties we experimented with are given in 
this section, along with brief indications of the pre- 
processing required to determine them. Many are 
similar to the kinds of surface properties suggested 
by Hearst (1992) and Light (1996). Some are based 
on properties found to be correlated with similar 
classes in the literature; others are based on observ- 
ing the tagged training data; and others were chosen 
based on intuition and the fact that the preprocessor 
is able to determine them (such as the tense of the 
main verb). 

The Treebank syntax trees were used for only one 
purpose, to identify the main clause of the sentence. 
The reason that the main clause must be identified is 
only because we define the problem as classifying the 
main clause. The features could easily be adapted 
to any clause, whether or not it is the main clause. 

The main verb of the clause to be classified is the 
pivot of some of the properties. We adopt Quirk 
et al.'s definition of a main verb (1985), and use a 
finite-state machine to skip over the various types 
of auxiliaries and identify the main verb automati- 
cally. In identifying and applying the collocational 
properties listed below, the morphological analyzer 
described in Karp et al. (1992) is used to match the 
root forms of words, and Brill's tagger (Brill 1992) 
is used to assign parts of speech. 

We begin with the non-collocational properties, 
listing first those from the best experiment we found. 
Listed second are properties that were chosen in 
some experiment for inclusion in the most accurate 
model. This occurred either on the current data with 
a different subset of features than those in the best 
experiment, or on an earlier version of the annotated 
data. In this earlier version of the problem defini- 
tion, the annotations were less context-sensitive, and 
the task was more like traditional word-sense disam- 
biguation. Listed third are those we did not succeed 
with. 

4.1 Non-Collocational Properties 
4.1.1 The Current Best 

The following non-collocational features are the 
best we found for the current problem. 

1. Whether or not the sentence begins a new para- 
graph. Paragraphs are already delimited in the 
Treebank corpus. 

Psychological experiments have shown a corre- 
lation between paragraph breaks and point of 



Tabic 1: Distribution of Classes 



Class 


Percentage of the Corpus 


Private state 


10% 


Speech category 1: 
direct speech 


09% 


Speech category 2: 

mixed direct and indirect speech 


04% 


Speech category 3: 
other speech event 


24% 


Borderline private state and other event 


01% 


Other state or event 


52% 



Table 2: 10-fold Results Varying Collocation Type and Feature Organization 



Co-occurrence Patterns 


Within-5 Patterns 


Syntactic Patterns 




Accuracy 


Precision 


Recall 


Accuracy 


Precision 


Recall 


Accuracy 


Precision 


Recall 


OR-1 


0.6838 


0.6967 


0.9815 


0.6020 


0.6144 


0.9799 


0.7039 


0.7056 


0.9976 


OR-2 


0.7063 


0.7164 


0.9858 


0.7082 


0.7147 


0.9909 


0.7114 


0.7158 


0.9937 


PC-1 


0.5315 


0.5364 


0.9906 


0.5550 


0.5568 


0.9969 


0.7382 


0.7431 


0.9933 


PC-2 


0.6500 


0.6571 


0.9886 


0.6567 


0.6604 


0.9945 


0.7468 


0.7495 


0.9965 



view sentences (Stark 1987, Bruder and Wiebe 
1990). That this is one of the best features lends 
further support to those findings. 

2. Percentage of the sentences so far in the current 
paragraph that the system classified as private- 
state or speech-event sentences. The value is 1 if 
this proportion is greater than 0.3, otherwise. 
The goodness of this feature also gives evidence 
for the importance of the paragraph as a unit 
for this problem. 

3. Define a quote ratio, R = N/M, where N is 
the number of words which are within quotation 
marks in the sentence, and M is the total num- 
ber of words in the sentence. There are three 
levels to this property: R greater than 0.3; R 
between 0.3 and 0.1; and R less than 0.1. 

4. Whether or not "according to" appears. 

Good in other experiments 

1. WordNet synsets (Miller 1990). This property 
was motivated by uses of WordNet synsets in 
Resnik (1993) and Roget categories in Yarowsky 
(1992). 

For abstract classes we need to extend coverage 
beyond individual word collocations. Thus, we 
experimented with the following synset proper- 
ties. Let W be a set of words chosen as col- 



locations in some manner (see sections 4.2 and 
||). A synset property is whether or not there 
is a member of the same synset as a member of 
W in the sentence (keeping to the same part of 
speech). 

2. The class assigned by the system to the previous 
sentence, i.e., a 2-gram property. 

3. Whether or not the subject of the main clause 
contains a proper noun. 

4. Whether or not the subject of the main clause 
contains a personal pronoun. 

The preprocessor uses the output of a proper 
name recognizer developed by Jim Cowie at the 
Computing Research Laboratory at NMSU. 

5. A set of binary properties, each mapped to its 
own feature: "that" appearing within a window 
after the main verb of the main clause; a comma 
appearing before the main verb; and a colon 
appearing just after the verb. We intend, in 
the near future, to treat these the sa me w ay 
that collocations are treated (see section L2 on 
collocational properties) . 

6. The tense of the main verb. 

7. The absence or presence of "to" followed by the 
pattern NPapprox-short within X words (+ or 
-) of the main verb, where 
NPapprox-short = det* adj* noun + adj* 
Example: "The company looked attractive to 
the investors". 



4.1.2 Not found to be useful 

1. The length of the current sentence (above or 
below a threshold). 

This property was meant to be an approxima- 
tion of whether or not the sentence is a complex 
sentence. 

2. The number of sentences in the current article 
(above or below a threshold). 

This is a property of the entire article. The 
intuition is that longer articles are more likely 
to express reactions to events and motivations 
for actions. The difficulty of such properties 
for supervised learning methods is data sparsity, 
since the objects are entire articles rather than 
sentences. 

3. The total number of proper nouns in the ar- 
ticle, another property of articles rather than 
sentences. 

4.2 Collocational Properties 

By collocation we mean a relationship between a 
word and the annotation class. In this study, we con- 
sider a range of collocational patterns, from simple 
co-occurrence to those defined by syntactic expres- 
sions. Like many others (e.g., Hearst 1992, Berger 
et al. 1996, Robin 1996, Golding and Schabes 1996), 
our best results were obtained using collocations 
based on regular expressions composed of part-of- 
speech tags and the root forms of words. Such col- 
locations can better pinpoint a particular state or 
event out of all those referred to in the sentence. 
When one event is being targeted, as in information 
extraction and event categorization, there is often 
noise if the entire sentence is considered. 

Our sy ntactic collocational patterns are defined 
in section 4.2.1 below. These patterns define basic 



syntactic structures that are not specific to our par- 
ticular problem. 

In addition to the syntactic patterns, we also ex- 
perimented with two simpler collocational patterns 
that are co mmon ly u sed in NLP. These are presented 



in sections 4.2.2 and 4.2.3 . 

Below, the symbol mairi-verb-MC refers to the 
main verb of the main clause, and NPapprox is 
defined as follows: NPapprox = NPapprox-short | 
NPapprox-short prep NPapprox. 



4.2.1 Syntactic Patterns. 

baseMVColPat = {v | v is main_verb-MC}. 
E.g., "She believes that Mary is sweet." 

baseAdjColPat = {a | a is in the pattern ( 
main_verb-MC adv* a), where the main verb is 
copular} 

E.g., "She is/seems happy" 



complcxMVColPat = {v | v is in the pattern 
(main_verb-MC adv* [ NPapprox ] [ "to" ] v) , where 
v is a main verb} 
E.g., "He made her jump." 

complcxAdjColPat = {a | a is in the pattern 
(main_verb-MC adv* [ NPapprox ] [ "to" ] adv* v 
adv* a), where v is a main_verb and v is copular} 
E.g., "He tried to be happy" or "It lead him to pos- 
sibly be very happy." 

We also experimented with noun syntactic pat- 
terns, but did not identify any that improved per- 
formance. 

4.2.2 Within-5 Patterns. 

One for each of verbs, nouns, and adjectives: 
Within-5 = {w | w appears within 5 words (+ or -) 
of main_verb-MC}. 

4.2.3 Co-occurrence Patterns. 

One for each of verbs, nouns, and adjectives: 
Co-occurrence = {w | w appears anywhere in the 
sentence}. 

5 Selecting Collocations and 
Organizing Information into 
Features 

There are a number of ways to organize collocational 
properties, such as those defined above, into fea- 
tures. To produce the results presented above in 
section ||, we systematically varied the type of orga- 
nization used. 

The patterns defined above are used in combina- 
tion with a selection method to identify individual 
collocations. The organization of the collocations 
into features and the method used to identify the in- 
dividual collocations are interdependent. Let there 
be c annotation classes, C\ to C c . Let there be p col- 
locational patterns, Pi to P p (e.g., baseMVColPat is 
one such pattern). 

Then there are two ways to select collocations: 
(1) select words that are correlated with class Cj 
when they appear in pattern Pj] these are referred 
to as per- class collocations, and are denoted as 
WordsCiPj; and (2) select words that, when they 
appear in pattern Pj, are correlated with the clas- 
sification variable across its entire range of values. 
These are referred to as over-range collocations, and 
are denoted as WordsPj. 

5.1 Identification of Per-Class Collocations 

5.1.1 Criterion for Identifying Collocations. 

The method used here and in Ng and Lee (1996) 
for forming the collocation sets WordsCiPj is (in 
the experiments, we use k = 0.5): 

WordsdPj = {w | P{Ci\w in Pj) > k} 



5.1.2 Organizations 

We experimented with two organizations that are 
in greatest contrast with the over-range organiza- 
tions given below. 

Organization per-class-1 There is one binary fea- 
ture for each class Cj , whose value is 1 if any mem- 
ber of any of the sets WordsCiPj appears in the 
sentence, 1 < j < p. 

Organization per-class-2 For each pattern Pj, de- 
fine a feature with c + 1 values as follows: 
For 1 < i < c, there is one value which corresponds 
to the presence of a word in WordsCiPj. Each fea- 
ture also has a value for the absence of any of those 
words. 



5.2 Identification of Over-Range 
Collocations 

5.2.1 Criterion for Identifying Collocations. 

In this alternative, the members of the colloca- 
tion sets WordsPj are identified as follows. G 2 (or 
another goodncss-of-fit test) is applied to identify 
words w such that, when w appears in pattern Pj, 
the model of independence between the classification 
variable and w has a poor fit. 
Organization over-range- 1 

This organization is used in positional features such 
as in Gale et al. (1992a) and Leacock et al. (1993). 
Define one feature per pattern Pj , with | WordsPj \ 
+ 1 values, one value for each word in WordsPj (i.e., 
each word selected for pattern Pj using G 2 as de- 
scribed above). Each feature also has a value for the 
absence of any word in WordsPj . 

Organization over-range-2 

This organization is commonly used in NLP. Define 
a binary feature for each word in each set WordsPj , 
l<j<P- 

6 Discussion 

As can be seen in table 2, the best results are ob- 
tained with the per-class-2 organization, which is not 
commonly used in NLP. 

Notice in table 2 that good results are obtained 
with the per-class organizations and the syntactic 
patterns. But poorer results are obtained with the 
per-class organizations and the simpler collocational 
patterns. The simpler collocational patterns can 
give relatively good results — they do so when used 
with the over-range organizations. 

Table 3: Positive and False Positive Occurrences of 
Collocational Features using Organization PC-1 



(averages across features) 





Total Positive 


False Positive 


N=255 


Instances 


Instances 


Co-occurrence 


84 


52 


Within- 5 


73 


41 


Syntactic patterns 


21 


7 



In comparison to the more restrictive (syntac- 
tic) patterns, the less restrictive (co-occurrence and 
within-5) patterns identify properties that occur 
more frequently, but do not as strongly select one 
of the classes. To see this, consider table 3, which 
contains frequency information for one of the folds of 
the experiments whose results are in table 2, row 3. 
The first column shows that the total number of pos- 
itive instances is much higher for the less restrictive 
collocational patterns than for the more restrictive 
ones. The second column shows that the number 
of false positives (e.g., a ps collocation that appears 
with a class other than ps) is also much higher for 
the less restrictive collocational patterns. 

Organization per-class-1 admits the least amount 
of interaction between the words in the collocation 
sets and the other features: all the collocation words 
are grouped into one value of one feature. The less 
restrictive properties benefit from the organizations 
that permit more interaction. In interaction with 
other features, these properties becomes stronger in- 
dicators of a specific class. 

With the over-range organizations, the syntactic 
patterns lead to many variable values for which there 
are seldom positive instances (since even grouped to- 
gether, the frequency is low, as table 3 shows). The 
experiments presented in table 2 demonstrate that 
having many variables that contribute no evidence 
for most instances can harm accuracy. Methods have 
been proposed for handling low-frequency, highly in- 
dicative properties. One is to consider only collo- 
cations that occur above some threshold frequency 
(e.g., Smadja 1993 and Ng and Lee 1996). How- 
ever, it is desirable to be able to retain these words, 
because when they occur, they are good indicators. 
Hearst (1992) addresses this problem by considering 
only positive evidence. Similarly, Yarowsky (1993) 
considers only the single best piece of evidence that 
occurs. Another way to handle this problem is the 
one presented here: by identifying the collocations 
using the per-class method, one is able to retain 
low-frequency, highly indicative properties by con- 
solidating them into fewer variables. 

7 Conclusion 

This paper presented the results of a study in which 
a fully automatic system for event categorization was 
developed and tested. The system was developed us- 
ing a recent method for formulating a probabilistic 



model to use in classification. Although the cate- 
gorization task is complex, 10-fold cross validation 
results were presented, showing good performance: 
75% accuracy, which is a 44% improvement over the 
lower bound. Some manual tuning of features raise 
the results above 78%. 

Our focus in this paper was feature selection. 
Many different contextual properties were described 
and evaluated. The features evaluated in this study 
would be applicable to other event categorization 
and information extraction tasks for which one event 
out of many in a sentence is targeted, or for which 
the classifications are highly context dependent. In 
future work, we plan to investigate including the ad- 
ditional features that Siegel (1997) and Klavans & 
Chodorow (1992) found to be important for state 
versus event classification. 

In addition to identifying relevant contextual 
properties, contrasting approaches to organizing col- 
locational properties into features were defined and 
systematically tested. The results suggest that a 
grouping of features allowing fewer interactions is 
desirable for low frequency, highly indicative prop- 
erties. On the other hand, the results suggest that 
higher- frequency, less indicative properties yield bet- 
ter results when the information is organized so that 
a greater degree of interaction among variables can 
be exploited. While these findings were obtained 
using a particular method for model selection, they 
should be equally applicable to any classification sys- 
tem that allows interactions among features and sup- 
ports the types of features described in this study. 
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